Currently there are many machine learning models that have been deployed to predict whether if a person will default their loan amount or they will pay it back. These decisions are made using predictive modelling and ML by using various factors of the person, such as their education level, sex, personal status, checking amount, number of savings bonds and many more. The aim of this project to use one of such publicly available datasets, the Statlog - German Credit Risk Dataset which has anonymized data of many customers of a bank, with their personal details and whether if they had defaulted their loan amount or they were good customers and paid back their loan amount. Within this project I aim to first pre-process the data into both user readable and machine readable format, explore the data and derive inferences, and finally use this to predict whether if a person will default their loan or not.
Keywords - German-Credit-Risk, Machine-Learning, Predictive Modelling
Most banks’ main source of income is from providing loans for their customers. They store people’s money and pay them some interest on that money, and to some other customers they provide a loan for a purpose at a higher interest than before. This margin between the saving interest and loan interest is where banks make most of their money.
But every time a bank provides a loan it is facing a risk of the loan not being paid back. Generally, banks take some type of collateral such as a person’s property. However, most banks would want to even avoid providing a person who will default their loan since they are losing money and time value of money. In that time they could have invested in a loan to a person who will pay their loan.
Therefore, it is crucial to determine whether if a person is a defaulter or someone who will pay back their loan before the bank even provides the loan. In this project I pre-processed the data, then plot graphs using the powerful R Programming Language and the plotly package. Using these graphs I have also derived inferences from these plots and finally use the data to build a machine learning model that predicts whether if a person will be a defaulter or not.
I also wish to make a Shiny web application that takes all of the required data and predicts whether if a person can be provided with a loan or not.
PROJECT: https://github.com/suryasashankgundepudi/german-credit-risk-modelling
SHINY WEB APP: YET TO BE DEVELOPED
This dataset was provided by Dr. Hand Hoffmann from the University of Hamburg (Universit"at Hamburg). It is publicly available for data scientists to use at the UCI MACHINE LEARNING REPOSITOY. The direct link to the dataset, with both numeric and the true data, is at - STATLOG-GERMAN-CREDIT.
The data contains anonymized data of 1000 customers who have either defaulted their bank loan or have paid back their credits duly. It contains 20 attributes, 7 of which are numerical and 13 of which are categorical. These attributes contain relevant information about the customer. They have been listed below:
The target variable is the outcome or risk taken by the bank. It contains 1 if the risk taken was good and the person was not a defaulter and 0 if the person was a defaulter.
Within this section I will cover some of the basic data-preprocessing techniques I had employed to get to a more understandable and descriptive data.
The data was first read from the UCI- machine learning repository using the following chunck of code. The required package for this chunk is RCurl
i saved this data into a new directory for further processing.
The table below shows how the data looks without any kind of pre-processing
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A11 | 6 | A34 | A43 | 1169 | A65 | A75 | 4 | A93 | A101 | 4 | A121 | 67 | A143 | A152 | 2 | A173 | 1 | A192 | A201 | 1 |
| A12 | 48 | A32 | A43 | 5951 | A61 | A73 | 2 | A92 | A101 | 2 | A121 | 22 | A143 | A152 | 1 | A173 | 1 | A191 | A201 | 2 |
| A14 | 12 | A34 | A46 | 2096 | A61 | A74 | 2 | A93 | A101 | 3 | A121 | 49 | A143 | A152 | 1 | A172 | 2 | A191 | A201 | 1 |
| A11 | 42 | A32 | A42 | 7882 | A61 | A74 | 2 | A93 | A103 | 4 | A122 | 45 | A143 | A153 | 1 | A173 | 2 | A191 | A201 | 1 |
| A11 | 24 | A33 | A40 | 4870 | A61 | A73 | 3 | A93 | A101 | 4 | A124 | 53 | A143 | A153 | 2 | A173 | 2 | A191 | A201 | 2 |
As you can see the data does not have any type of column names, and the data by itself does not have names, but is instead in the form of various categorical columns.
Since the data by itself cannot be used for any type of exploratory data analysis, I had used a switch case type of code to provide names for each and every data point in the data. For my reference I had used the German.doc provided at the UCI machine learning repository. The file contained a detailed description of each and every attribute and what each value meant. For getting a better understanding of the rudimentary yet robust code I had employed please visit the project page at - German-credit-Risk-Repo,
Now after cleaning the data for exploratory data analysis I was able to get to a more descriptive data:
| Checking.Account | Duration | Credit.History | Purpose | Credit.Amount | Savings.Account.Bonds | Present.employee |
|---|---|---|---|---|---|---|
| < 0 | 6 | critical account/other credits existing (not at this bank) | Radio or Television | 1169 | Unkown/No Savings Account | Exp >= 7 |
| 0 <= Checking < 200 | 48 | existing credits paid back duly till now | Radio or Television | 5951 | Less than 100 | 1 <= Exp < 4 |
| No Checking account | 12 | critical account/other credits existing (not at this bank) | Education | 2096 | Less than 100 | 4 <= Exp <7 |
| < 0 | 42 | existing credits paid back duly till now | Furniture/Equipment | 7882 | Less than 100 | 4 <= Exp <7 |
| < 0 | 24 | delay in paying off in the past | New Car | 4870 | Less than 100 | 1 <= Exp < 4 |
However, for the later part of this project I also had to make sure the data was in a machine readable format. This was completed using the superML package available in R. The following R-Code chunk helps us convert this characted based dataframe into numeric data for predictive modelling.
# Reading the clean data.
data <- read.csv("data/eda-german-credit.csv")
# Defining a new variable which takes col names of qualitative columns
catColumns <- c("Checking.Account", "Credit.History", "Purpose",
"Savings.Account.Bonds", "Present.employee",
"Other.Debters", "Property", "Other.Installment.plans",
"Housing", "Job", "Telephone", "Foreign.Worker", "Outcome",
"Sex", "Personal.Status")
tf_data <- data.frame(data)
# Transforming each qualitative column into numerical labels
for (column in catColumns){
label <- LabelEncoder$new()
tf_data[, column] <- label$fit_transform(tf_data[, column])
}
# Saving the data into a new file for later use
write.csv(tf_data, "data/machine-ready-credit.csv", row.names = FALSE)
The data after being converted for machine readable format looked like this. As you would’ve expected the data had no character variables.
| Checking.Account | Duration | Credit.History | Purpose | Credit.Amount | Savings.Account.Bonds | Present.employee |
|---|---|---|---|---|---|---|
| 0 | 6 | 0 | 0 | 1169 | 0 | 0 |
| 1 | 48 | 1 | 0 | 5951 | 1 | 1 |
| 2 | 12 | 0 | 1 | 2096 | 1 | 2 |
| 0 | 42 | 1 | 2 | 7882 | 1 | 2 |
| 0 | 24 | 2 | 3 | 4870 | 1 | 1 |
Since the data is now more descriptive, I attempted to plot various graphs, most of which are interactive to derive inferences. There are major parts of this data analysis module.
Each of these categories aim to provide a better understanding of the distribution of the data across various demographics. There are also some miscellaneous plots I have included, which I thought would help me derive more inferences.
I initially wanted to understand the target variable’s (whether if the loan provided was a good decision or a bad one) distribution. The bar graph shown below lets us understand it better.
From this graph we understand that there is a class imbalance in our target variable. Though the number of people who defaulted their loan is lesser than the customers who paid back their credits duely it is still a high ratio and it is our aim to reduce the number of defaulted loan decisions.
To get an idea of how the population of our dataset was distributed I plotted a histogram that shows the distribution of the age group across the two genders and as a whole.
It is understood that the age group of people who wish to take a loan are in their 20s and 30s. This is irrespective of gender which can be seen in the overall distribution.
In the next plot I plot the reasons why men and women take a loan. To visualize this I have plotted a horizontal grouped bar graph that shows the distribution of men and women across various purposes.
The graph is plotted by taking the percentage of the number of men and women for different purposes, and then plotting them side by side. From this it is inferred that in general, for all categories other than furniture and domestic appliances, there is a higher percentage of men who take a loan than women.
In the next graph I plot the distribution of the credit amount that men and women have in their bank accounts. The x axis plots the amount of money in Deutsche Mark and the y axis plots the count of the same.
It could be hypothesized that the gender of a person does not affect their credit amount and that majority of the population has a credit amount in between 1000 DMK and 2500 DMK.
Finally, for our gender analysis I have attempted to see if gender affects a person’s loan outcome. The next plot shows the count of men and women who have good and bad outcomes respectively.
It could be understood from the above plot that men in general have a higher ratio of good to bad outcome than women. However, the data might not be completely representative of the general population as there is an imbalance between the number of men and women.
Summary of Gender Analysis
In the age analysis module I attempted to see if the various age groups have better or lesser risk. I also try to look at the credit ammount distribution but I do not look at outliers as much in this analysis.
The population has been split into 4 mmajor age groups equaly.
For our first plot I plotted a stacked histogram with age distribution for people with good and bad credit.
It can be inferred from this graph that majority of the younger population are the people with bad risk. However, the graph is also right skewed for good credit.
However, the age group of people with good credit lie in their late 20s and early 30s.
The next graph is a box plot of various age groups against credit amount. This way I will be able to see if different age groups are more or less rich than the other groups.
Young adults and Adults have a higher credit than other age groups. This also shows that in general people with lesser credit amount have bad outcome.
Another representation of the same is a violin plot as shown below. The violin plot also provides us with similar inferences as the box plot.
Finally I plot a stacked bar graph against good and bad loans for different age groups to understand the ratio of good and bad credits.
From this graph it is understood that young adults have the highest ratio of good to bad risk outcomes. On the other hand, seniors are surprisingly the ones who have the lowest ratio of good to bad risk outcomes.
Summary of Age Analysis
Here I try to understand how people from different wealth classes are distributed in our data-set.
Surprisingly people from the higher class, ie with more amount in savings have lower ratio of good to bad outcomes. Also, people from the lower savings sector have a higher ratio of good to bad credit. However, the highest class of people have the highest good to bad credit ratio.
A similar distribution can be seen based on people of different types of credit payment.
For the data provided, the job attribute is split into different levels of skill and industry. In this analysis I plotted 2 different plots. One with the different types of job and their credit amount.
It can be postulated from the two graphs above that Self employed or highly qualified professionals have high good and bad outcome. The people with high credit amount are also people in the highly qualified or self employed professionals. It is my opinion that since this part of this attribute incluede self emplyed people, they might take loans for their businesses and these businesses might not have been able to pay back their loan. This might also explain why they have so much credit amount.
Summary of Wealth and Job Analysis
The following two graphs are plotted majorly look at the distribution of various types of home owners in good and bad outcomes.
We can see that people who live for free have lowest ratio of good to bad outcome for a loan payment, and that home owners have highest ratio of the same.
Finally, to understand why people wish to take up the loan I plotted various box plots. The graph is as shown below.
Though majority of the population who take up loans for other purpose take up the highest amount of money for their loans. It can also be seen that the next type of people to take up loans are the ones who wish to pay their car loans. The people who take up loans for domestic appliances are the ones who take lowest amount for their loan.
This concludes our Exploratory data analysis section. We will now move on to predictive modelling using various machine learning techniques
Within this section I employed various machine learning algorithms to classify whether if a person will default their loan or not. For predictive modelling I had employed the Python Programming Language to implement ML algorithms because of its better support for the same.
Some of the algorithms I used are
We will be looking at the precision, recall and f1 score for these algorithms. The code for this can be found in the interactive python notebook at the project repository
| NAME.OF.ML.ALGORITHM.USED | PRECISION.0 | PRECISION.1 | RECALL.0 | RECALL.1 | F1.SCORE.0 | F1.SCORE.1 |
|---|---|---|---|---|---|---|
| DECISION TREES | 0.77 | 0.39 | 0.71 | 0.46 | 0.74 | 0.42 |
| LOGISTIC REGRESSION | 0.79 | 0.65 | 0.91 | 0.42 | 0.85 | 0.51 |
| RANDOM FOREST | 0.78 | 0.66 | 0.92 0 | 0.38 | 0.85 | 0.48 |
| XGBOOST | 0.82 | 0.69 | 0.9 | 0.51 | 0.86 | 0.59 |
| QUADRATIC DISCRIMINANT ANALYSIS | 0.83 | 0.57 | 0.82 | 0.58 | 0.82 | 0.58 |
| SUPPORT VECTOR CLASSIFIERS | 0.77 | 0.66 | 0.94 | 0.29 | 0.84 | 0.40 |
Though the results are not as great I hope to implement a fine tuned Deep learning model that provides us with better results.
The German Credit data was read from the UCU machine learning repository. Initial data pre-processing was implemented to bring about clean and understandable data. Then the data was used to perform exploratory data analysis and derive inferences. Finally the machine ready data was scaled and used for predicting if a person would default their loans or not. Various machine learning algorithms were used for this purpose and the XGBoost model performed the best compared to other algorithms.
Right now I am a little busy with my Senior year at college, and I wish to in the future make a deep learning algorithm for this data. We can also implement the trained algorithms for a Shiny Web application. But most of all, I wish to implement the machine learning algorithms using R Programming. Since the data is also not representative one could implement data augmentation to make the data more descriptive.